Before you begin this worksheet this week, please reinstall
DigitalMethodsData from GitHub by running:
devtools::install_github("regan008/DigitalMethodsData") in
your console. Also be sure that you have installed the Tidyverse
library.
R has powerful tools for manipulating data. The Tidyverse is a collection of packages for R that are designed for data science. Take a look at the website for the Tidyverse and the list of packages that are included at: https://www.tidyverse.org/packages/
dplyr()We’ll start with dplyr which is described as “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” The verbs included in this package are:
select(): picks variables based on their names.mutate(): adds new variables that are functions of
existing variables.filter(): picks cases based on their values.summarise(): reduces multiple values down to a single
summary.arrange(): changes the ordering of the rows.All of these verbs play nicely and combine naturally with
group_by() which allows you to perform any operation “by
group”.
Lets load some data and libraries for our work.
library(DigitalMethodsData)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
data("gayguides")
Lets start with select(). This function allows you to
subset columns using their names and types.
gayguides %>%
select(title, Year)
Notice that this subsetted the data and returned only the title and
year. However, it didn’t modify the gayguides data or save
it to a new variable because we didn’t assign the result to
anything.
select() to take the city and state from gayguides
and add them to a dataframe called “locations”.locations <- gayguides %>%
select(city, state) %>%
data.frame()
I assigned a new variable named “locations” to the result of the functions select() and data.frame(). Because if not, the result would not be saved into the Environment. The data.frame function created a new dataframe for the resulting selection of the seelct function.
select() to grab all the columns of
gayguides EXCEPT for the city and state? Hint: You might
want to read the documentation for this function.gayguides %>%
select (!city & !state)
The filter function subsets a data frame and retains all the rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all of the conditions you provide.
gayguides %>% filter(Year > 1980)
Filter also works with the logical values we learned earlier this semester.
gayguides %>% filter(Year == 1970 | Year == 1980)
And strings:
gayguides %>%
filter(city == "Greenville")
gayguides %>%
filter(city == 'Greenville' & state== 'SC')
gayguides %>%
filter(Year > 1975 & Year < 1980)
(did you mean every “entry”?)
gayguides %>%
filter(city == 'Greenville' & state == 'SC') %>%
filter(Year > 1975 & Year < 1980)
test.1975NOTnycsf <- gayguides %>%
filter(Year == 1975 & city != 'New York') %>%
filter (city != 'San Francisco')
filter(gayguides, grepl('(G)', amenityfeatures))
filter(gayguides, grepl('(L)', amenityfeatures))
I could only figure out how to use grepl() to get two different dataframes, one for (G) and other for (L). I couldn’t get both results in one single dataframe.
The mutate() function adds new variables and preserves
existing one. This is useful when you want to create a new column based
on other values. For example, in the statepopulation
dataset, we want to ask “How much did the population increase between
1800 and 1900 in each state?.” We can do that by subtracting the
population in 1900 from 1800 and storing that value in a new column.
data("statepopulations")
statepopulations %>% mutate(difference = X1900 - X1800)
data("BostonWomenVoters")
BostonWomenVoters %>%
mutate(birth.year = 1920 - Age)
gayguides into a new column called location? It should
list the city, state. (i.e. San Diego, CA)gayguides %>%
mutate(location = paste(city, state, sep = ", "))
Arrange() orders the rows of a data frame by the values
of selected columns. In other words it sorts a data frame by a variable.
In the gayguides data, we can sort the data by year with
the earliest year first. If we wanted the latest year first, we could do
so by using the desc() function.
gayguides %>%
arrange(Year)
gayguides %>%
arrange(desc(Year))